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Abstract 

Model combination techniques have con¬ 
sistently shown state-of-the-art perfor¬ 
mance across multiple tasks, including 
syntactic parsing. However, they dramat¬ 
ically increase runtime and can be diffi¬ 
cult to employ in practice. We demon¬ 
strate that applying constituency model 
combination techniques to n-best lists in¬ 
stead of n different parsers results in sig¬ 
nificant parsing accuracy improvements. 
Parses are weighted by their probabilities 
and combined using an adapted version 


of Sagae and Lavie (2006). These accu¬ 
racy gains come with marginal computa¬ 
tional costs and are obtained on top of ex¬ 
isting parsing techniques such as discrim¬ 
inative reranking and self-training, result¬ 
ing in state-of-the-art accuracy: 92.6% on 
WSJ section 23. On out-of-domain cor¬ 
pora, accuracy is improved by 0.4% on 
average. We empirically confirm that six 
well-known n-best parsers benefit from 
the proposed methods across six domains. 

1 Introduction 

Researchers have proposed many algorithms to 
combine parses from multiple parsers into one fi¬ 


nal parse (Henderson and Brill, 1999 Zeman and 


Zabokrtsky, 2005 

Sagae and Lavie, 2006 

Now- 

son and Dale, 2007; Fossum and Knight, 2009; 

Petrov, 2010 

Johnson and Ural, 2010; Huang et 


|al., 2010| |McDonald and Nivre, 20111 |Shindo et] 


hybridization, where multiple parses are combined 
into a single parse; switching, which picks a sin¬ 
gle parse according to some criteria (usually a 
form of voting); grammar merging where gram¬ 
mars are combined before or during parsing; and 
stacking, where one parser sends its prediction 
to another at runtime. All of these have at least 
one of the caveats that (1) overall computation 
is increased and runtime is determined by the 
slowest parser and (2) using multiple parsers in¬ 
creases the system complexity, making it more 
difficult to deploy in practice. In this paper, we 
describe a simple hybridization extension (“fu¬ 
sion”) which obtains much of hybridization’s ben¬ 
efits while using only a single n-bcst parser and 
minimal extra computation. Our method treats 
each parse in a single parser’s n-best list as a 
parse from n separate parsers. We then adapt 


parse combination methods by Henderson and 
Brill (1999]), Sagae and Lavie (2006), and Fos- 
sum and Knight (2009) to fuse the constituents 
from the n parses into a single tree. We empir¬ 
ically show that six n-best parsers benefit from 
parse fusion across six domains, obtaining state- 
of-the-art results. These improvements are com¬ 
plementary to other techniques such as rerank¬ 
ing and self-training. Our best system obtains 
an F\ of 92.6% on WSJ section 23, a score pre¬ 
viously obtained only by combining the outputs 
from multiple parsers. A reference implementa¬ 
tion is available as part of BLLIP Parser at http: 
//github.com/BLLIP/bllip-parser/ 

2 Fusion 


al., 2012, Narayan and Cohen, 2015). These new Henderson and Brill (1999) propose a method to 


parses are substantially better than the originals: 


Zhang et al. (2009) combine outputs from mul¬ 


tiple n-best parsers and achieve an F\ of 92.6% 
on the WSJ test set, a 0.5% improvement over 
their best n-best parser. Model combination ap¬ 
proaches tend to fall into the following categories: 


combine trees from m parsers in three steps: pop¬ 
ulate a chart with constituents along with the num¬ 
ber of times they appear in the trees; remove any 
constituent with count less than m/2 from the 
chart; and finally create a final tree with all the 
remaining constituents. Intuitively their method 
































































constructs a tree with constituents from the ma¬ 
jority of the trees, which boosts precision signif¬ 
icantly. Henderson and Brill (1999) show that 
this process is guaranteed to produce a valid tree. 
Sagae and Lavie (2006) generalize this work by 


reparsing the chart populated with constituents 
whose counts are above a certain threshold. By 
adjusting the threshold on development data, their 
generalized method balances precision and recall. 


Fossum and Knight (2009) further extend this line 


of work by using /(-best lists from multiple parsers 
and combining productions in addition to con¬ 
stituents. Their model assigns sums of joint proba¬ 
bilities of constituents and parsers to constituents. 
Surprisingly, exploiting n-best trees does not lead 
to large improvement over combining 1-best trees 
in their - experiments. 

Our extension takes the n-best bees from a 
parser as if they are 1-best parses from n parsers, 
then follows Sagae and Lavie (2006). Parses 
are weighted by the estimated probabilities from 
the parser. Given n trees and their weights, the 
model computes a constituent’s weight by sum¬ 
ming weights of all trees containing that con¬ 
stituent. Concretely, the weight of a constituent 
spanning from /th word to yth word with label l is 


ce(i^j) = J2 w ( k )Ce(i^3) ( 1 ) 

k =1 

where W (k) is the weight of kth bee and C\ [i —> 
j ) is one if a constituent with label t spanning from 
i to j is in kth tree, zero otherwise. After populat¬ 
ing the chart with constituents and their weights, 
it throws out constituents with weights below a set 
threshold t. Using the threshold t = 0.5 emulates 
the method of Henderson and Brill (1999) in that 
it constructs the tree with the constituents in the 
majority of the trees. The CYK parsing algorithm 
is applied to the chart to produce the final tree. 

Note that populating the chart is linear - in the 
number of words and the chart contains substan¬ 
tially fewer constituents than charts in well-known 
parsers, making this a fast procedure. 


2.1 Score distribution over trees 

We assume that n-best parsers provide trees along 
with some kind of scores (often probabilities or 
log probabilities). Given these scores, a natural 
way to obtain weights is to normalize the prob¬ 
abilities. However, parsers do not always provide 
accurate estimates of parse quality. We may obtain 


better performance from parse fusion by altering 
this distribution and passing scores through a non¬ 
linear function, /(•). The fcth parse is weighted: 


W(k) 


/(SCORE(fe)) 

£r=l/(SCORE(i)) 


( 2 ) 


where SCORE(i) is the score of ith treeQ We ex¬ 
plore the family of functions f(x) = x rl which can 
smooth or sharpen the score distributions. This in¬ 
cludes a tunable parameter, j3 G M ( |: 


W(k) 


SCORE(A:) /3 
£ILl SCORE (i)P 


(3) 


Employing j8 < 1 flattens the score distribution 
over n-best trees and helps over-confident parsers. 
On the other hand, having {3 > 1 skews the distri¬ 
bution toward parses with higher scores and helps 
under-confident parsers. Note that setting 8 = 0 
weights all parses equally and results in majority 
voting at the constituent level. We leave develop¬ 
ing other nonlinear functions for fusion as future 
work. 


3 Experiments 


Corpora: Parse fusion is evaluated on British 
National Corpus (BNC), Brown, GENIA, Ques¬ 
tion Bank (QB), Switchboard (SB) and Wall Street 


Journal (WSJ) (Foster and van Genabith, 2008 


Francis and Kucera, 1989} |Kim et al., 2003 


Judge et al., 2006} |Godfrey et al., 1992} Mar¬ 


cus et a l., 1 993). WSJ is used to evaluate in¬ 
domain parsing, the remaining five are used for 
out-of-domain. For divisions, we use tune and test 
splits from Bacchiani et al. (2006 ) for Brown, Mc- 
Closky’s test PMID^Jfor GENIA, Stanford’s test 
split50for QuestionBank, and articles 4()00—4153 
for Switchboard. 

Parsers: The methods are applied to six widely 
used n-best parsers: Charniak (2000), Stanford 


(Klein and Manning, 2003), BLLIP (Charniak and 


Johnson ,~2005| ), Self-trained BLLIP (McClosky et 


al., 2006F Berkeley (Petrov et al., 2006), and 


Stanford RNN (Socher et al., 2013 1. The list of 
parsers and their accuracies on the WSJ test set 
is reported in Table [T] We convert to Stanford 


‘For parsers that return log probabilities, we turn these 
into probabilities first. 

‘http://nlp.stanford.edu/~mcclosky/ 
biomedical.html 

; http://nip.Stanford.edu/data/ 
QuestionBank-Stanford.shtml 

‘‘Using the ‘WSJ+Gigaword-v2’ BLLIP model. 























































Parser 

F i 

UAS 

LAS 

Stanford 

85.4 

90.0 

87.3 

Stanford RNN^j 

89.6 

92.9 

90.4 

Berkeley 

90.0 

93.5 

91.2 

Charniak 

89.7 

93.2 

90.8 

BLLIP 

91.5 

94.4 

92.0 

Self-trained BLLIP 

92.2 

94.7 

92.2 


Table 1: Six parsers along with their 1-best F\ 
scores, unlabeled attachment scores (UAS) and la¬ 
beled attachment scores (LAS) on WSJ section 23. 


Dependencies (basic dependencies, version 3.3.0) 
and provide dependency metrics (UAS, LAS) as 
well. 

Supervised parsers are trained on the WSJ 
training set (sections 2-21) and use section 24 
for development. Self-trained BLLIP was self- 
trained using two million sentences from Giga- 
word and Stanford RNN uses word embeddings 
trained from larger corpora. 

Parameter tuning: There are three parameters for 
our fusion process: the size of the n-best list (2 < 


n < 50), the smoothing exponent from Section 2.1 
(/3 G [0.5,1.5] with 0.1 increments), and the mini¬ 
mum threshold for constituents {t G [0.2,0.7] with 
0.01 increments). We use grid search to tune these 
parameters for two separate scenarios. When pars¬ 
ing WSJ (in-domain), we tune parameters on WSJ 
section 24. For the remaining corpora (out-of- 
domain), we use the tuning section from Brown. 
Each parser is tuned separately, resulting in 12 
different tuning scenarios. In practice, though, 
in-domain and out-of-domain tuning regimes tend 
to pick similar settings within a parser. Across 
parsers, settings are also fairly similar (n is usu¬ 
ally 30 or 40, t is usually between 0.45 and 0.5). 
While the smoothing exponent varies from 0.5 to 
1.3, setting /3 = 1 does not significantly hurt ac¬ 
curacy for most parsers. 

To study the effects of these parameters, Fig¬ 
ure [T] shows three slices of the tuning surface for 
BLLIP parser on WSJ section 24 around the op¬ 
timal settings (n = 30, /3 = 1.1, t = 0.47). In 
each graph, one of the parameters is varied while 
the other is held constant. Increasing n-best size 
improves accuracy until about n = 30 where there 
seems to be sufficient diversity. For BLFIP, the 


; Socher et al. (2013 i report an F\ of 90.4%, but this is 
the result of using an ensemble of two RNNs (p.c.). We use a 
single RNN in this work. 


Parser 

WSJ 

Brown 

BLLIP 

90.6 

85.7 

+ Fusion 

91.0 

86.0 

+ Majority voting (3 = 0) 

89.1 

83.8 

+ Rank-based weighting 

89.3 

84.1 


Table 2: F\ of a baseline parser, fusion, and base¬ 
lines on development sections of corpora (WSJ 
section 24 and Brown tune). 


smoothing exponent (3) is best set around 1.0, 
with accuracy falling off if the value deviates too 
much. Finally, the threshold parameter is empiri¬ 
cally optimized a little below t = 0.5 (the value 
suggested by Henderson and Brill (1999)). Since 
score values are normalized, this means that con¬ 
stituent need roughly half the “score mass” in or¬ 
der to be included in the chart. Varying the thresh¬ 
old changes the precision/recall balance since a 
high threshold adds only the most confident con¬ 
stituents to the chart (Sagae and Favie, 2006). 
Baselines: Table [2] gives the accuracy of fusion 
and baselines for BLLIP on the development cor¬ 
pora. Majority voting sets n = 50, 3 = 0, t = 
0.5 giving all parses equal weight and results in 
constituent-level majority voting. We explore a 
rank-based weighting which ignores parse prob¬ 
abilities and weight parses only using the rank: 
Wr ank ffc) = l/(2 fc ). These show that accu¬ 
rate parse-level scores arc critical for good perfor¬ 
mance. 


Final evaluation: Table [3] gives our final re¬ 
sults for all parsers across all domains. Results 
in blue are significant at p < 0.01 using a ran¬ 
domized permutation test. Fusion generally im¬ 
proves F\ for in-domain and out-of-domain pars¬ 
ing by a significant margin. For the self-trained 
BLFIP parser, in-domain 7j increases by 0.4% 
and out-of-domain F\ increases by 0.4% on av¬ 
erage. Berkeley parser obtains the smallest gains 
from fusion since Berkeley’s n-best lists are or¬ 
dered by factors other than probabilities. As a re¬ 
sult, the probabilities from Berkeley can mislead 
the fusion process. 


We also compare against model combination 


using our reimplementation of Sagae and La vie 


(2006} ). For these results, all six parsers were given 


equal weight. The threshold was set to 0.42 to 
optimize model combination F\ on development 
data (similar to Setting 2 for constituency parsing 
in Sagae and Favie (2006)). Model combination 






















92.0 


91.5 
91.0 

90.5 
90.0 

2 10 20 30 40 50 0.5 0.75 1.0 1.25 1.5 0.2 0.3 0.4 0.5 0.6 0.7 

n-best list size j3 (smoothing exponent) t (threshold) 

Figure 1: Tuning parameters independently for BLLIP and their impact on F\ for WSJ section 24 (solid 
purple line). For each graph, non-tuned parameters were set at the optimal configuration for BLLIP 
(n = 30, = 1.1, t = 0.47). The dashed grey line represents the 1-best baseline at 90.6% F\. 
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Parser 

BNC 

Brown 

GENIA 

SB 

QB 

WSJ 

Stanford 

78.4/79.6 

80.7/81.6 

73.1 /73.9 

67.0/67.9 

78.6/80.0 

85.4/86.2 

Stanford RNN 

82.0/82.3 

84.0/84.3 

76.0/76.2 

70.7/71.2 

82.9/83.6 

89.6/89.7 

Berkeley 

82.3 / 82.9 

84.6/84.6 

76.4/76.6 

74.5 /75.1 

86.5 / 85.9 

90.0 / 90.3 

Charniak 

82.5 /83.0 

83.9/84.6 

74.8 /75.7 

76.8/77.6 

85.6/86.3 

89.7/90.1 

BLLIP 

84.1 /84.7 

85.8 / 86.0 

76.7/77.1 

79.2/79.5 

88.1 / 88.9 

91.5/91.7 

Self-trained BLLIP 

85.2/85.8 

87.4/87.7 

77.8 /78.2 

80.9/81.7 

89.5 / 89.5 

92.2/92.6 

Model combination 

86.6 

87.7 

79.4 

80.9 

89.3 

92.5 


Table 3: Evaluation of the constituency fusion method on six parsers across six domains, x/y indicates 
the F\ from the baseline parser (x) and the baseline parser with fusion (y) respectively. Blue indicates a 
statistically significant difference between fusion and its baseline parser (p < 0.01). 


performs better than fusion on BNC and GENIA, 
but surprisingly fusion outperforms model com¬ 
bination on three of the six domains (not usually 
not by a significant margin). With further tuning 
(e.g., specific weights for each constituent-parser 
pair), the benefits from model combination should 
increase. 


Multilingual evaluation: We evaluate fusion with 


the Berkeley parser on Arabic ( 

Maamouri et al., 

2004 Green and Manning, 2010 

), French (Abeille 

et ah, 2003), and German ( 

Brants et al., 2002) 

from the SPMRL 2014 shared task ( 

Seddah et al., 


2014) but did not observe any improvement. We 


suspect this has to do with the same ranking issues 
seen in the Berkeley Parser’s English results. On 


the other hand, fusion helps the parser of Narayan 


and Cohen (2015) on the German NEGRA tree- 


bank (Skutetal., I997| to improve from 80.9% to 
82.4%. 


Runtime: As discussed in Section[2j fusion’s run¬ 
time overhead is minimal. Reranking parsers (e.g., 
BLLIP and Stanford RNN) already need to per¬ 
form n-best decoding as input for the reranker. 
Using a somewhat optimized implementation fu¬ 
sion in C++, the overhead over BLLIP parser is 


less than 1%. 


Discussion: Why does fusion help? It is possible 
that a parser’s n-list and its scores act as a weak 
approximation to the full parse forest. As a result, 
fusion seems to provide part of the benefits seen in 


forest reranking (Fluang, 2008). 


Results from Fossum and Knight (2009) imply 


that fusion and model combination might not be 
complementary. Both n-best lists and additional 
parsers provide syntactic diversity. While addi¬ 
tional parsers provide greater diversity, n-best lists 
from common parsers are varied enough to pro¬ 
vide improvements for parse hybridization. 

We analyzed how often fusion produces com¬ 
pletely novel trees. For BLLIP on WSJ section 
24, this only happens about 11 % of the time. 
Fusion picks the 1-best tree 72% of the time. 
This means that for the remaining 17%, fusion 
picks an existing parse from the rest of the n- 
list, acting similar to a reranker. When fusion 
creates unique trees, they are significantly better 
than the original 1-best trees (for the 11% sub¬ 
set of WSJ 24, F\ scores are 85.5% with fusion 
and 84.1% without, p < 0.003). This constrasts 


with McClosky et al. (2012) where novel predic 















































tions from model combination (stacking) were 
worse than baseline performance. The difference 
is that novel predictions with fusion better incor¬ 
porate model confidence whereas when stacking, 
a novel prediction is less trusted than those pro¬ 
duced by one or both of the base parsers. 
Preliminary extensions: Here, we summarize 
two extensions to fusion which have yet to show 
benefits. The first extension explores applying 
fusion to dependency parsing. We explored two 
ways to apply fusion when starting from con¬ 
stituency parses: (1) fuse constituents and then 
convert them to dependencies and (2) convert to 
dependencies then fuse the dependencies as in 


Sagae and Lavie (2006). Approach (1) does not 


provide any benefit (LAS drops between 0.5% and 
2.4%). This may result from fusion’s artifacts in¬ 
cluding unusual unary chains or nodes with a large 
number of children — it is possible that adjusting 
unary handling and the precision/recall tradeoff 
may reduce these issues. Approach (2) provided 
only modest benefits compared to those from con¬ 
stituency parsing fusion. The largest LAS increase 
for (2) is 0.6% for the Stanford Parser, though for 
Berkeley and Self-trained BLLIP, dependency fu¬ 
sion results in small losses (-0.1% LAS). Two pos¬ 
sible reasons are that the dependency baseline is 
higher than its constituency counterpart and some 
dependency graphs from the n-best list are dupli¬ 
cates which lowers diversity and may need special 
handling, but this remains an open question. 

While fusion helps on top of a self-trained 
parser, we also explored whether a fused parser 
can self-train (McClosky et al., 20061. To test this, 
we (1) parsed two million sentences with BLLIP 
(trained on WSJ), (2) fused those parses, (3) added 
the fused parses to the gold training set, and (4) 
retrained the parser on the expanded training. The 
resulting model did not perform better than a self- 
trained parsing model that didn’t use fusion. 


from actual model combination techniques but at 
a fraction of the computational cost. Additionally, 
improvements are not limited to a single parser or 
domain. Fusion improves parser accuracy for six 
n-best parsers both in-domain and out-of-domain. 

Future work includes applying fusion to n-best 
dependency parsers and additional (parser, lan¬ 
guage) pairs. We also intend to explore how to bet¬ 
ter apply fusion to converted dependencies from 
constituency parsers. Lastly, it would be interest¬ 
ing to adapt fusion to other structured prediction 
tasks where n-best lists are available. 
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